Title: Is synthetic data useful for science? From statistical inference to hypothesis selection
About the Talk:
Generative models can be used as simulators to provide synthetic data, with use cases proposed across many scientific domains. For example, social scientists have explored simulating survey respondents using an LLM. Proponents hope that synthetic data would allow scientists to make inferences from less (or perhaps even no) real data, improving the efficiency of costly experiments. The challenge: we have no guarantee that samples from a generative model reflect the distribution of interest. This talk will explore how much synthetic data can help under such conditions.
The speaker will start by presenting statistical methods to perform inference on an unknown parameter using a combination of real and synthetic samples. It is shown that by generating paired real-synthetic samples (e.g., via few-shot prompting to an LLM), an appropriately designed method-of-moments estimator can learn from the correlation structure between them. The estimator provides standard inferential guarantees like asymptotic normality and valid confidence intervals regardless of the quality of the synthetic data, with improvements in efficiency when synthetic data is informative. While this allows synthetic data to be safely used to "amplify" small amounts of real data, many proposals situate generative models earlier in the scientific pipeline. For example, should we use language models to screen potential hypotheses before collecting data at all? The speaker will present a decision-theoretic view of this process in a stylized model, showing what characteristics generative models should have to accelerate scientific discovery.
Language: English
About the Speaker: Bryan Wilder is an Assistant Professor in the Machine Learning Department at Carnegie Mellon University and Director of the Lab for AI and Social Impact. His research focuses on the foundations of machine learning in social, policy, and healthcare settings, combining new methods with field evaluations to improve AI’s real-world impact. He collaborates with governments, nonprofits, and health systems, and his work has been supported by Schmidt Sciences, NSF, NIH, CDC, the Engler Family Foundation, and ARO. He earned a PhD in Computer Science from Harvard and was a Schmidt Science Fellow at Harvard School of Public Health. He currently serves as Chair of the Board for EAAMO and the ACM EAAMO conference.